Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

Algorithms for Binary Neural Networks

Algorithm 3 Progressive Optimization with Center Loss

Input: The training dataset; the full-precision kernels C; the pre-trained kernels ^tC from ternary

PCNNs; the projection matrix W; the learning rates η1 and η2.

Output: The binary PCNNs are based on the updated C and W.

1: Initialize W randomly but C from ^tC;

2: repeat

// Forward propagation

for l = 1 to L do

ˆC^l

i,j ^←^P⁽^{W, C}^l

i^{); // using Eq. 3.43}

D^l

i ^←^Concatenate^{( ˆ}^C^i,j^{); // using Eq. 3.45}

Perform activation binarization; //using the sign function

Traditional 2D convolution; // using Eq. 3.46, 3.47 and 3.48

end for

10:

Calculate cross-entropy loss LS;

11:

if using center loss then

12:

L^′= LS + LC;

13:

else

14:

L^′= LS;

15:

end if

16:

// Backward propagation

17:

Compute δ ˆ

C^l

i,j ⁼

∂L^′

∂^ˆ

C^l

i,j ^;

18:

for l = L to 1 do

19:

// Calculate the gradients

20:

calculate δCl

i^{; // using Eq. 3.49, 3.51 and 3.52}

21:

calculate δW l

j ^{; // using Eq. 3.115, 3.116 and 3.56}

22:

// Update the parameters

23:

C^l

i ^←^C^l

i ⁻^η¹^δC^l

i^{; // Eq. 3.50}

24:

W ^l

j ^←^W^l

j ⁻^η²^δW ^l

j ^{; //Eq. 3.54}

25:

end for

26:

Adjust the learning rates η1 and η2.

27: until the network converges

3.5.8

Ablation Study

Parameter As mentioned above, the proposed projection loss, similar to clustering, can

control quantization. We computed the distributions of the full-precision kernels and vi-

sualized the results in Figs. 3.14 and 3.15. The hyperparameter λ is designed to balance

projection loss and cross-entropy loss. We vary it from 1e −3 to 1e −5 and ﬁnally set it

to 0 in Fig. 3.14, where the variance increases as the number of λ. When λ=0, only one

cluster is obtained, where the kernel weights are tightly distributed around the threshold

= 0. This could result in instability during binarization because little noise may cause a

positive weight to be negative and vice versa.

We also show the evolution of the distribution of how projection loss works in the training

process in Fig. 3.15. A natural question is: do we always need a large λ? As a discrete

optimization problem, the answer is no, and the experiment in Table 3.4 can verify it, i.e.,

both the projection loss and the cross-entropy loss should be considered simultaneously

with good balance. For example, when λ is set to 1e −4, the accuracy is higher than those

with other values. Thus, we ﬁx λ to 1e −4 in the following experiments.

Learning convergence For PCNN-22 in Table 3.2, the PCNN model is trained for 200

epochs and then used to perform inference. In Fig. 3.16, we plot training and test loss with

λ = 0 and λ = 1e −4, respectively. It clearly shows that PCNNs with λ = 1e −4 (blue